home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Tech Arsenal 1
/
Tech Arsenal (Arsenal Computer).ISO
/
tek-15
/
raidmk11.zip
/
RAIDMARK.DOC
< prev
next >
Wrap
Text File
|
1993-01-18
|
32KB
|
677 lines
Benchmarking Storage Systems
White Paper
by
Ohad I. Unna
DynaTek Automation Systems Inc.
September, 1992
Benchmarking Storage Systems
Abstract
Existing disk benchmarking programs fall short in measuring the
performance of the storage subsystem as a whole, in the sense of
missing the forest for the trees. In order to obtain a meaningful
performance index of a storage subsystem that is congruent with the
user-perceived usefulness of that system, a benchmark has to measure
the performance in an application-level manner, simulating real-life
computing environments. A method for constructing such a benchmark is
presented in this paper.
Forward
or
Who needs another disk benchmarking program?
It is said that if you have a watch you know what time it is, but if
you have two -- you are never sure. If you have half a dozen disk
benchmarking utilities, you probably have six different sets of
results, and there's a good chance none of these results bear any
resemblance to the way your system performs under its normal workload.
Back in the old days (1982-3) things were pretty simple - PC hard disk
drives were invariably spinning at 3600 revolutions per minute, almost
always had 17 sectors per track, and built-in bad-sector remapping and
multi-megabyte cache were practically unheard of in the microcomputer
world. When the IDE and SCSI disk interfaces were introduced to the PC
arena, things got really complicated; disk drive vendors started
employing sector-mapping schemes in order to avoid the 1024-cylinder
restriction, controllers were enhanced with a large onboard cache, and
on-the-fly disk compression adapters and software become more and more
popular, rendering most of the old disk benchmarking software useless.
If you are not familiar with terms such as "sector-mapping" and
"compression adapters," don't worry. As a user, you shouldn't really
care whether your drive has "skewed-track interleave" or "low BIOS
overhead latency," as long as you get good performance out of it.
Application Level Benchmarks
or
Why "your mileage may vary"
Have you ever been tempted to buy one of those new "effective access
time: 0.5 milliseconds" disk drives? Look before you leap: obtaining
such a figure for a current technology drive means the drive
manufacturer is using a less-than-adequate disk benchmarking program.
Such a program can easily be fooled by the presence of a small onboard
cache; by continuously reading and writing data to the same area of
the disk, the benchmark merely updates the RAM on the disk controller,
while very little disk activity actually takes place. Updating a few
Page 2
kilobytes of RAM in one half of a millisecond isn't that impressive,
and this will become evident as soon as you start running real-life
applications. If you are using a software disk cache, as is now
standard on most operating systems, you will hardly notice the
presence of an additional 256KB controller-based cache, unless you
have a nonstandard configuration or device.
Actual storage system performance depends on an assortment of
parameters: bus bandwidth, cache strategy, scatter-gather algorithm,
cluster size and fragmentation, to name a few. None of these
parameters are influenced by the physical characteristics of the
specific hard drive used, and each of them has more impact on overall
storage system performance than the proverbial "average seek time."
The more advanced disk benchmarking programs try to eliminate the
effects of operating system level parameters by communicating directly
to the disk. Unfortunately, results obtained in this way are about as
useful as knowing the top RPM of a car engine when running in neutral.
Two cases present even vaguer prospects: networks and disk-arrays. In
both cases the same hardware can easily be configured to work in
different modes, producing a vast range of performance. With networks,
network topology may be redefined, and server configuration may be
altered, resulting in an overall system throughput that has little to
do with the actual performance of the drives themselves. The second
special case is disk-arrays configured as a RAID cluster; the
performance of the array as a unit can be as high as twice or more the
performance of a single drive or as low as a fraction of the
performance of a single drive.
The experienced user will try to estimate the importance of each
performance feature of the storage system, weigh them against each
other, and make a decision based more on intuition than on hard facts.
Eventually, the user will run the ultimate benchmark: a test drive of
the system with their favorite application. As unscientific as it may
seem, the results of this test are more meaningful than any of the
standard benchmarks.
The application level benchmarking approach is defined as testing a
system in a manner that resembles actual usage as closely as possible.
This seems to be the most consistent and useful approach to the
problem of measuring storage system performance.
Page 3
Shortcomings of Existing Benchmarks
or
Why results are only as good as the measurement tool
* Cache
Since most existing disk benchmarking programs were designed to
give results within a few seconds of activation, they usually
perform their tests by moving very small amounts of data. With
modern high-end storage devices you could have anywhere from a
few kilobytes to 64 megabytes or more of cache between the CPU
and the storage system. If a benchmarking program uses standard
file accesses, most or all of the data wouldn't even reach the
drive by the time the measurement period is over, yielding
ludicrous results, such as a transfer rate of 20 megabytes per
second. If the cache is embedded within the hardware, as is
typical of high-end SCSI adapters, it would present the same
problem for benchmarking software which uses direct access to the
drive. In order to measure the true effect of a cache, the
benchmark should move vast amounts of data during the test, use
huge files for disk access, and generally behave like a typical
application. A good benchmark shouldn't eliminate the cache
effect, but rather measure the true benefit an actual application
will gain by the presence of that cache.
* Compression
Disk compression hardware and utilities are new players in the
small-computer arena. Before data is written to the drive it is
compressed either by a software device driver or by a dedicated
hardware adapter. During reads, data is uncompressed back to its
original form. Best performance is achieved when the data is
highly regular, such as is common in large spreadsheets or
databases. However, as the ratio between CPU speed and disk
throughput grows, new applications will have built-in data
compression, especially those dealing with highly compressible
data, such as video and music. When this happens, the benefits of
disk-based compression will diminish. A typical disk benchmarking
program uses either fixed data blocks or whatever garbage data
happens to be in memory at the time. In either case, the data
tends to be highly compressible, resulting in exceptionally high
test scores for transfer rate. In order to obtain true results,
the benchmark should use data patterns which resemble the actual
data as closely as possible. Since this is virtually impossible,
the next best thing is to select a reference pattern which can be
easily defined and whose maximum compression ratio is known in
advance. The only data pattern that meets both of these criteria
is an incompressible pattern. Information theory assures us that
a suitable pseudorandom sequence has zero redundancy and thus
cannot be compressed. As far as the compression logic is
concerned, a pseudorandom sequence seems identical to a
pre-compressed sequence.
Page 4
* Multi-threading
A drive with a long seek time may perform very well for
sequential accesses, but it will cease doing so as soon as
several threads are running concurrently, even if all threads
access the same file, and all do so sequentially. The necessity
to switch tasks results in multiple seeks up to a point where,
with a sufficiently large number of threads, the behavior of the
drive will resemble totally random access. The extent to which
this effect will degrade performance cannot be computed a-priori
using only the knowledge of average random access time and
transfer rate, and thus has to be tested per se. A small number
(3-6) of concurrent threads is usually enough to feel the full
effect of sequential access degradation.
* Sequential vs. Random
Existing benchmarks generally divide the type of disk accesses
into two categories: sequential access and random access.
However, in real life, only very few of the applications actually
behave this way. A modern disk controller with a high bandwidth
and deferred-write caching can almost totally eliminate seek
latency for applications performing only a small portion of
random accesses. The same drive, when used with a write-through
cache, might perform almost as slowly as fully random access. A
good benchmark should present the system with the correct mixture
of sequential and random accesses in order to get valid results.
* Combining reads and writes
Much like combining sequential and random accesses, again, a
mixture of read and write requests may yield results which differ
substantially form the 'expected' weighed average of the two.
With the proper logic programmed in the drive controller or
device driver, and the appropriate scatter/gather method
implemented in the operating system, read and write requests can
be overlapped from the CPU's point of view, resulting in superior
throughput. Some of the more rudimentary disk benchmarking
programs do not perform any writes at all during the test, and
simply assume the transfer rate is the same as for reads. This
assumption falls apart when RAID class devices are concerned.
Depending on RAID level and architecture, the time needed to
write a block may differ from the time it takes to read the same
block by a factor of two or more. An inherently sluggish RAID
device may seem exceptionally fast when tested by only reading
existing data, or when the mixture of read and write operations
does not match the actual application mixture.
Page 5
Defining a Reference Disk Benchmark
or
What's a RAIDmark?
The DynaTek RAIDmark utility is intended to measure the integrity,
performance, and cost-efficiency of any read/write random access
(blocked I/O) storage subsystem in terms that are meaningful for the
layman, and that depict the apparent usefulness the user gets from the
system. RAIDmark is also intended to measure some low-level metrics of
the storage device, which can be utilized for testing and fine-tuning
such devices throughout the phases of on-site installation.
The RAIDmark utility defines a set of six standard tests, each
simulating a typical I/O-bounded computer environment. For each test,
the tested system receives a grade proportional to its throughput. A
unit of 'One RAIDmark' is defined as the measured throughput (for each
test) of an IBM PS/2 model 95 with the standard IBM 150 MB hard disk.
Integrity is tested on a go/no-go basis using 32-bit checksum
verification on each record.
* Low-level performance
The utility measures raw indexes of storage device performance:
average access time, read/write transfer rate, effective cache
size and cache strategy.
* High-level performance
The utility tests the performance of the storage subsystem on the
application level, checking the subsystem's ability to respond to
harsh real-world data traffic of typical heavy load environments.
The utility simulates data flow of several typical work
environments and measures system response and throughput in each
of the test cases, giving a combined figure of merit for the
system's performance in each scenario.
The following scenarios are simulated and tested:
- Typical workstation applications - multiuser (3-10 users)
environment, many small- to medium-size file requests, most
of which are read-only accesses with less frequent large
file read/write operations (as in CAD/CAM file save or
memory swap).
- Transaction processing environment - 5-50 users, many small
record read and update requests, reports and backup
processes running in the background, transaction logbook as
a large sequential file.
- Video/audio/multimedia applications - single or few users,
huge sequential access files, with both read and write
requests.
Page 6
- Database query systems - 5-200 users, many small record
accesses, almost all of which are read-only; updates,
reports and backup are executed during system off-line time.
- Data logging - up to several hundreds of sites, many very
small random write accesses (typically 20-100 bytes),
reports and backup are executed during system off-line time.
- Archiving - few users with large sequential read/write
accesses.
* Degraded subsystem performance
When the subject of the tests is a RAID-class storage system, the
user can disable one of the subsystem elements, and then proceed
to measure the degraded performance in this hindered setup.
* Design constraints and limitations
- Free disk space
In order for the RAIDmark utility to conduct the tests it
should be able to create files of several megabytes each, or
even larger, if caching effects are to be measured
accurately. For best accuracy, at least 150MB of free disk
space should be available on the tested drive.
- Operating system interface
When using a multi-processing environment, all benchmarks
should be run as a sole process, otherwise time measurements
will be of little value.
- Integrity and stability
The ability of RAID-class devices to operate with parts of
the subsystem inoperable (or altogether missing) might
create a situation in which the user is not aware that the
system is operating under highly unfavorable conditions
which adversely affect the benchmark. Moreover, modern disk
controllers can overcome some physical mishaps, such as bad
sectors, unaligned heads, etc., relying on firmware logic to
correct these problems at the expense of degraded
performance. The user should identify and isolate such
conditions, possibly by testing for stability over time and
over different areas of the disk platters.
- DOS FAT Latency
When using top-level MS-DOS file I/O requests, each random
file seek requires the operating system to scan the file
allocation table (FAT) from the location pointing to the
beginning of the file, to the location pointing to the
Page 7
requested cluster. Although the FAT is usually kept in RAM,
for large files this sequential scanning process may
actually take longer than the physical movement of the disk
arm. Some advanced applications (database managers, etc.)
overcome this delay by keeping several logical entry points
to the file ("file handles") scattered at different points
in the file. Access is then achieved using the nearest file
handle, reducing FAT latency by a factor of 10 to 20. Other
applications, such as MS-Windows 3.0/3.1 Virtual Memory
Manager, implement their own file system to achieve zero FAT
latency. More modern operating systems, such as OS/2 HPFS,
achieve the same favorable results using advanced built-in
random seek methods.
For the sake of being as general as possible, RAIDmark
measures the exact FAT latency on the tested system by
comparing the average access times for clusters located in a
limited region near the beginning of the file to those of
clusters located in a similar size region near the end of
the file. The average FAT latency is then deducted from all
random access test times, giving the theoretical limit for
an 'optimally tuned' application.
* Test files
Test files are homogeneous streams of random values, logically
accessed as an array of 32-bit unsigned integers. The stream is
logically subdivided into records of 16 double-words
(long-words), each containing an embedded checksum.
Each record in a test file contains 16 double-words of 32 bits
each. Fifteen of these double-words are generated by a
pseudorandom number generator. The 16th value is calculated in
such a way that the sum of all words in the record plus the
offset of the record from the beginning of the file (in bytes)
equals zero (modulo 2^32). This file structure enables the user
to create files that are incompressible by nature, but still
allows the user to verify the validity of the data with an
error-detection probability of near certainty. However, this
method requires one multiplication and two addition operations
per 32-bit integer during file creation, but only one addition
per double-word during read access.
To prevent meaningless results obtained from slow processor
machines with concurrent disk access, a 32KB block of data (512
records of 64 bytes) is prepared in advance, and records from
this block are written to the disk at file creation, with only
one word in each 16-word record changed for each write operation
to account for different file offsets. This will give even a slow
CPU ample time to complete data preparation before the next write
operation is due.
Page 8
Theoretically, data created in the method described above can be
compressed. However, there is currently no compression algorithm
capable of detecting the 256Kbit-long patterns created using this
method.
A question arises as to the extent to which this random-data file
serves as a good test case, since in reality most data files are
compressible by 30%-60% or, in some homogeneous data acquisition
applications, up to 80%; however, using compressible data exposes
us to the vast range of compression algorithms, each having its
own peculiarities regarding specific data patterns, sometimes
resulting in a twofold factor in throughput for some specific
instance. Hence, the approach taken in RAIDmark is the worst-case
compression scenario, in which all algorithms behave in a similar
fashion, namely, no compression occurs.
In a future version the user may be given the option of
generating semi-random ("pink noise") data files, permitting the
compression mechanism to affect results.
- Test File Creation
If any of the disk tests is requested, a 128MB file is
created. This process is timed as accurately as possible,
both the initial file open operation, and the writing of
data on the disk. Measurement is done separately for chunks
of data of varying sizes. The first two chunks are 16KB
long, the third is 32KB long, the fourth -- 64KB, and so on,
with the 8th chunk being 1MB long. The remaining 1MB chunks
are written to bring the total to the final size.
The 128MB file size was chosen arbitrarily, being larger
than today's biggest disk cache units, but less than half
the size of modern high capacity disks. This is a default
value, intended to become the 'RAIDmark standard,' but
values ranging from 1MB to 1GB are implemented and selected
using a command line option. Test files smaller than 16MB
might produce inaccurate results, and the user will be
warned in case such a selection was made. Using the default
file size, each separate test will take between a few
seconds and two minutes to complete, permitting accurate
measurement without seizing the system for an intolerable
duration. There are 14 separate tests in a full test-suite,
so with all the testing overhead, running the full benchmark
may take the better part of an hour.
- Determine deferred-write cache size
During the file creation phase, effective transfer rate
throughput is measured for increasing data chunks as
described above. If the system contains a deferred-write
Page 9
cache, the first few chunks will take a very short time to
be written, usually exhibiting a throughput in excess of 10
megabytes per second. At some point, throughput will decline
sharply. This point in the data flow marks the end of the
posted write cache. With systems employing dynamic caching,
the value obtained for the cache size may be arbitrary, but
if the test is run on an otherwise idle system, the size
detected will be close to the maximum possible for that
system.
Note that in order for this test to run accurately, the
system should be given enough time to "rest" before the test
commences in order to enable previously cached data to be
flushed to disk by the cache logic. The user should wait
until all previous disk activity ceases before starting this
test.
- Determine raw transfer rate for write
After obtaining an upper bound for the posted-write cache
size, the actual non-cached transfer rate for write
operations is measured by first writing data to a different
location on the drive. The amount of data written should
exceed the size of the computed deferred-write cache. The
benchmark then compares total write times for data chunks of
different sizes, specifically 8KB and 32KB blocks. By
subtracting the two, all fixed-overhead delays are canceled
out, and the net result is the time it takes to write 24KB
of data.
This figure for transfer rate will usually be lower than the
nominal figure quoted by the drive manufacturer, since the
measurement involves drive head movements (seek and search
operations), operating system and BIOS overhead, and
cache-miss delays. However, the figure obtained this way is
a more meaningful measure of subsystem performance, since
this is the actual throughput as experienced by the
application and the end user. Delays created by the
benchmarking process utility itself are taken into account
and removed from the delay measurement.
During the "post-cache" phase of writing, each of the 1MB
chunks of data should take approximately the same time to
complete. If the variability is higher than expected, the
user will be warned of the system's instability. This
situation may arise when other processes in the system or
the network are active, when the cache strategy involves
some exotic algorithm, when the disk is highly fragmented,
when many bad sectors are scattered on the disk's test area,
or when a "flaky" drive causes retried attempts for some of
the operations. The user has the choice of repeating the
Page 10
test, or else referring to the user guide for
troubleshooting advice.
- Determine raw transfer rate for read
Transfer rate for sequential read access is measured by
comparing total read times for data chunks of different
sizes, specifically 8KB and 32KB blocks. By subtracting the
two, all fixed-overhead delays are canceled out, and the net
result is the time it takes to read 24KB of data.
- Main test executive
The main test executive procedure is a generic routine whose
input parameters define the real-life environment each test
case is supposed to simulate. Parameters include:
- record size
- distribution of access type
- number of users or processing threads
The procedure determines the number of times the tests are
to be repeated as a function of the duration of a single
test. The rule is that, by default, each test case should
take from a few seconds to around two minutes. If the total
desired testing time of the complete benchmark is given on
the command line, the duration of each single test is
adjusted accordingly. This time does not include benchmark
overhead, such as file creation and cache overflowing, hence
total benchmark duration may be considerably longer than the
command line parameter.
- Build random block
This routine creates a 32KB block in memory for writing onto
the disk during the file creation phase. The format of the
block is described above. The algorithm used for generating
random numbers is the Linear-Congruential method (described
in D.E. Knuth, "The Art of Computer Programming" Vol. 2).
The location used for inserting the checksum word for each
record is the location (first four bytes) of each record. In
order to account for different file offsets, this location
is updated for each instance of writing the 32KB block.
Page 11
Conclusion
In order to help the user select a storage system and configuration
that will match their computing needs, a standard reference benchmark
for measuring storage system performance has been defined. Existing
disk benchmarking programs fall short when measuring modern 'smart'
storage systems, or when unusual disk configurations are used. Many of
the most important performance factors are not taken into account, and
the characteristics of the user's application don't play any role in
the measurement when these utilities are used. The DynaTek RAIDmark
has been created to overcome these limitations, operating as an
application-level benchmark program. RAIDmark measures storage system
performance in various typical environment conditions, and provides
the user with a good indication of the expected usefulness of their
system.
* Technical reference
- Patterson, David A.; Gibson, Garth; Katz, Randy H.:
"A Case for Redundant Arrays of Inexpensive Disks (RAID)";
ACM SIGMOD (Chicago, 1988).
- Patterson, David A.; Chen, Peter; Gibson, Garth; Katz, Randy
H.:
"An Introduction to Redundant Arrays of Inexpensive Disks
(RAID)"; IEEE 1989.
- Integra Technologies, Inc. - Graham, Richard A.:
"Redundant Arrays of Inexpensive Disks" (1991)
- Finney, Kenneth C.; Graham, Richard A.:
"A Discussion of RAID Technology"; DynaTek/Integra, 1992.
DynaTek and RAIDmark are trademarks of Dynatek Automation Systems Inc.
HPFS, IBM, Integra, MS-DOS, MS-Windows, OS/2 and PS/2 are registered
trademarks of their respective holders.
Page 12